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Description 

COPYRIGHT NOTICE 

s A portion of the disclosure of this patent document contains material which is subject to copyright protection. The 
copyright owner has no objection to the xeroxographic reproduction by anyone of the patent document or the patent 
disclosure in exactly the form it appears in the Patent and Trademark Office patent file or records, but otherwise reserves 
all copyright rights whatsoever. 

w GOVERNMENT RIGHTS NOTICE 

Portions of the material in this specif ication arose in the course of or under contract nos. 92E R81 275 (SBIR) between 
Affymetrix, Inc. and the Department of Energy and/or H600813-1 , *2 between Affymetrix, Inc. and the National Institutes 
of Health. 
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BACKGROUND OF THE INVENTION 

The present invention relates to the field of computer systems. More specifically, the present invention relates to 
computer systems for visualizing biological sequences, as well as for evaluating and comparing biological sequences. 

20 Devices and computer systems for forming and using arrays of materials on a substrate are known. For example, 
PCT applications WO92/10588 and 95/11995, incorporated herein by reference for all purposes, describe techniques 
for sequencing or sequence checking nucleic acids and other materials. Arrays for performing these operations may be 
formed in arrays according to the methods of, for example, the pioneering techniques disclosed in U.S. Patent Nos. 
5,445.934 and 5384,261. and U.S. Patent Application No. 08^249,188. each incorporated herein by reference for all 

25 purposes. 

According to one aspect of the techniques described therein, an array of nucleic acid probes is fabricated at known 
locations on a chip or substrate. A labeled nucleic acid is then brought into contact with the chip and a scanner generates 
an image file (also called a cell file) indicating the locations where the labeled nucleic acids bound to the chip. Based 
upon the image file and identities of the probes at specific locations, it becomes possible to extract information such as 
30 the monomer sequence of DNA or RNA. Such systems have been used to form, for example, arrays of DNA thai may 
be used to study and detect mutations relevant to cystic fibrosis, the P53 gene {relevant to certain cancers), HIV, and 
other genetic characteristics. 

Improved computer systems and methods are needed to evaluate, analyze, and process the vast amount of infor- 
mation now used and made available by these pioneering technologies. 

35 

SUMMARY OF THE INVENTION 

An improved computer-aided system for visualizing and determining the sequence of nucleic acids is disclosed. 
The computer system provides, among other things, improved methods of analyzing fluorescent image files of a chip 
40 containing hybridized nucleic acid probes in order to call bases in sample nucleic acid sequences. 

According to one aspect of the invention, a computer system is used to identify an unknown base in a sample nucleic 
acid sequence by the steps of: 

- inputting multiple probe intensities, each of the probe intensities being associated with a nucleic acid probe; 

45 - the computer system comparing the multiple probe intensities where each of the probe intensities is substantially 
proportional to a nucleic acid probe hybridizing with at least one nucleic acid sequence; and 

calling the unknown base according to the results of the comparison of the multiple probe intensities. 

According to one specific aspect of the invention, a higher probe intensity is compared to a lower probe intensity to 
so call the unknown base. According to another specific aspect of the invention, probe intensities of a sample sequence 
are compared to probe intensities of a reference sequence. According to yet another specific aspect of the invention, 
probe intensities of a sample sequence are compared to statistics about probe intensities of a reference sequence from 
multiple experiments. 

According to another aspect of the invention, a method is disclosed of processing reference and sample nucleic 
65 acid sequences to reduce the variations between the experiments by the steps of: 

- providing a plurality of nucleic acid probes; 

labeling the reference nucleic acid sequence with a first marker ; 
• labeling the sample nucleic acid sequence with a second marker; and 
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hybridizing the labeled reference and sample nucleic acid sequences at the same time. 

According to another aspect of the invention, a computer system is used to identify mutations in a sample nucleic 
acid sequence by the steps of: 

5 - inputting a first set of probe intensities, each of the probe intensities in said fisrt set being associated with a nucleic 
acid probe and substantially proportional to the associated nucleic acid probe hybridizing with a reference nucleic 
acid sequence; 

• inputting a second set of probe intensities, each of the probe intensities in said fisrt set being associated with a 
nucleic acid probe and substantially proportional to the associated nucleic acid probe hybridizing with said sample 
10 sequence; 

- the computer system comparing probe intensities in the first set to probe intensities in the second set to select 
hybridization regions where the probe intensities in the first and second sets differ; and 

identifying mutations according to characteristics of the selected regions. 
is According to yet another aspect of the invention, a computer system is used for comparative analysis and visuali- 
zation of multiple sequences by the steps of: 

- displaying at least one reference sequence in a first area on a display device; and 
displaying at least one sample sequence in a second area on said display device; 

so 

whereby a user is capable of visually comparing the multiple sequences. 

A further understanding of the nature and advantages of the inventions herein may be realized by reference to the 
remaining portions of the specification and the attached drawings. 

25 BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates an example of a computer system used to execute the software of the present invention; 

Fig. 2 shows a system block diagram of a typical computer system used to execute the software of the present 

invention; 

30 Fig. 3 illustrates an overall system for forming and analyzing arrays of biological materials such as DNA or RNA; 

Fig. 4 is an illustration of the software for the overall system; 

Fig. 5 illustrates the global layout of a chip formed in the overall system; 

Fig. 6 illustrates conceptually the binding of probes on chips; 

Fig. 7 illustrates probes arranged in lanes on a chip; 
35 Fig. 8 illustrates a hybridization pattern of a target on a chip with a reference sequence as in Fig. 7; 

Fig. 9 illustrates the high level flow of the intensity ratio method; 

Fig. 1 0A illustrates the high level flow of one implementation of the reference method and Fig. 1 0B shows an analysis 
table for use with the reference method; 

Fig. 1 1 A illustrates the high level flow of another implementation of the reference method; Fig. 1 1 B shows a data 
40 table for use with the reference method; Fig. 1 1 C shows a graph of the normalized sample base intensities minus 
the normalized reference base intensities; and Fig. 1 1D shows other graphs of data in the data table; 
Fig. 12 illustrates the high level flow of the statistical method; 

Fig. 13 illustrates the pooling processing of a reference and sample nucleic acid sequence; 
Figs. 14A and 14C show graphs of scaled fluorescent intensities of wild-type probes hybridizing with sample and 
45 reference sequences and 1 4B shows a hypothetical graph of fluorescent intensities of wild-type probes hybridizing 
with two sample sequences and a reference sequence; 

Fig. 1 5 illustrates the high level flow of an embodiment that uses the hybridization data from than one base position 
to identify mutations in a sample sequence; 

Fig. 16 illustrates the main screen and the associated pull down menus for comparative analysis and visualization 
so of multiple experiments; 

Fig. 17 illustrates an intensity graph window for a selected base; 
Fig. 18 illustrates multiple intensity graph windows for selected bases; 

Fig. 19 illustrates the intensity ratio method correctly calling a mutation in solutions with varying concentrations; 
Fig. 20 illustrates the reference method correctly calling a mutant base where the intensity ratio method incorrectly 
55 called the mutant base; and 

Fig. 21 illustrates the output of the ViewSeq™ program with four pretreatment samples and four posttreatment sam- 
ples. 
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DESCRIPTION OF THE PREFERRED EMBODIMENT 
CONTENTS 

s I. General 

II. Intensity Ratio Method 

III. Reference Method 

IV. Statistical Method 

V. Pooling Processing 

10 VI. Comparative Analysis 
VII. Examples 

I. General 

is In the description that follows, the present invention will be described in reference to a Sun Workstation in a UNIX 
environment. The present invention, however, is not limited to any particular hardware or operating system environment. 
Instead, those skilled in the art wilt find that the systems and methods of the present invention may be advantageously 
applied to a variety of systems, including IBM personal computers running MS-DOS or Microsoft Windows. Therefore, 
the following description of specific systems are for purposes of illustration and not limitation. 

20 Fig. 1 illustrates an example of a computer system used to execute the software of the present invention. Fig. 1 
shows a computer system 1 which includes a monitor 3, screen 5, cabinet 7, keyboard 9, and mouse 1 1 . Mouse 1 1 may 
have one or more buttons such as mouse buttons 13. Cabinet 7 houses a floppy disk drive 14 and a hard drive (not 
shown) that may be utilized to store and retrieve software programs incorporating the present invention. Although a 
floppy disk 15 is shown as the removable media, other removable tangible media including CD-ROM, flash memory and 

25 tape may be utilized. Cabinet 7 also houses familiar computer components (not shown) such as a processor, memory, 
and the like. 

Fig. 2 shows a system block diagram of computer system 1 used to execute the software of the present invention. 

As in Fig. 1, computer system 1 includes monitor 3 and keyboard 9. Computer system 1 further includes subsystems 

such as a central processor 52. system memory 54, I/O controller 56. display adapter 58. serial port 62, disk 64, network 
30 interface 66. and speaker 68. Disk 64 is representative of an internal hard drive, floppy drive, CD-ROM, flash memory, 

tape, or any other storage medium. Other computer systems suitable for use with the present invention may include 

additional or fewer subsystems. For example, another computer system could include more than one processor 52 (i.e., 

a multi-processor system) or memory cache. 

Arrows such as 70 represent the system bus architecture of computer system 1. However, these arrows are illus- 
35 trative of any interconnection scheme serving to link the subsystems. For example, speaker 68 could be connected to 

the other subsystems through a port or have an internal direct connection to central processor 52. Computer system 1 

shown in Rg. 2 is but an example of a computer system suitable for use with the present invention. Other configurations 

of subsystems suitable for use with the present invention will be readily apparent to one of ordinary skill in the art. 
The VLSIPS™ technology provides methods of making very large arrays of oligonucleotide probes on very small 
40 chips. See US. Patent No. 5,143,854 and PCT patent publication Nos. WO 90/15070 and 92/10092, each of which is 

incorporated by reference for all purposes. The oligonucleotide probes on the DNA probe array are used to detect 

complementary nucleic acid sequences in a sample nucleic acid of interest (the Target" nucleic acid). 

The present invention provides methods of analyzing hybricfization intensity files for a chip containing hybridized 

nucleic acid probes. In a representative embodiment, the files represent fluorescence data from a biological array, but 
45 the files may also represent other data such as radioactive intensity data or large molecule detection data. Therefore. 

the present invention is not limited to analyzing fluorescent measurements of hybridizations but may be readily utilized 

to analyze other measurements of hybridization. 

For purposes of illustration, the present invention is described as being part of a computer system that designs a 

chip mask, synthesizes the probes on the chip, labels the nucleic acids, and scans the hybridized nucleic acid probes. 
so Such a system is fully described in U.S. Patent Application No. 08/249, 1 88 which has been incorporated by reference 

for all purposes. However, the present invention may be used separately from the overall system for analyzing data 

generated by such systems. 

Fig. 3 illustrates a computerized system for forming and analyzing arrays of biological materials such as RNA or 
DNA. A computer 100 is used to design arrays of biological polymers such as RNA or DNA. The computer 100 may be. 
55 for example, an appropriately programmed Sun Workstation or personal computer or workstation, such as an IBM PC 
equivalent including appropriate memory and a CPU as shown in Figs. 1 and 2. The computer system 100 obtains 
inputs from a user regarding characteristics of a gene of interest, and other inputs regarding the desired features of the 
array. Optionaily. the computer system may obtain information regarding a specific genetic sequence of interest from an 
externa) or internal database 102 such as Gen Bank. The output of the computer system 100 is a set of chip design 
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computer files 1 04 in the form of, for example, a switch matrix, as descrfoed in PCT application WO 92/10092. and other 
associated computer fdes. 

The chip design files are provided to a system 106 that designs the lithographic masks used in the fabrication of 
arrays of molecules such as DNA, The system or process 106 may include the hardware necessary to manufacture 

s masks 1 1 0 and also the necessary computer hardware and software 1 08 necessary to lay the mask patterns out on the 
mask in an efficient manner. As with the other features in Fig. 3, such equipment may or may not be located at the same 
physical site, but is shown together for ease of illustration in Fig. 3. The system 106 generates masks 110 or other 
synthesis patterns such as chrome-on-glass masks for use in the fabrication of polymer arrays. 

The masks 1 10, as well as selected information relating to the design of the chips from system 100, are used in a 

w synthesis system 112. Synthesis system 1 1 2 includes the necessary hardware and software used to fabricate arrays of 
polymers on a substrate or chip 1 1 4. For example, synthesizer 1 1 2 includes a light source 1 1 6 and a chemical flow cell 
1 1 8 on which the substrate or chip 1 14 is placed. Mask 1 10 is placed between the light source and the substrate/chip, 
and the two are translated relative to each other at appropriate times for deprotection of selected regions of the chip. 
Selected chemical reagents are directed through flow cell 1 1 8 for coupling to deprotected regions, as well as for washing 

is and other operations. All operations are preferably directed by an appropriately programmed computer 1 1 9, which may 
or may not be the same computer as the computer(s) used in mask design and mask making. 

The substrates fabricated by synthesis system 112 are optionally diced into smaller chips and exposed to marked 
receptors. The receptors may or may not be complementary to one or more of the molecules on the substrate. The 
receptors are marked with a label such as a fluorescein label (indicated by an asterisk in Fig. 3) and placed in scanning 

20 system 1 20. Scanning system 1 20 again operates under the direction of an appropriately programmed digital computer 
122, which also may or may not be the same computer as the computers used in synthesis, mask making, and mask 
design. The scanner 120 includes a detection device 124 such as a confocal microscope or CCD (charge-coupled 
device) that is used to detect the locations where labeled receptor (*) has bound to the substrate. The output of scanner 
120 is an image file(s) 124 indicating, in the case of fluorescein labeled receptor, the fluorescence intensity (photon 

25 counts or other related measurements, such as voltage) as a function of position on the substrate. Since higher photon 
counts will be observed where the labeled receptor has bound more strongly to the array of polymers, and since the 
monomer sequence of the polymers on the substrate is known as a function of position, it becomes possible to determine 
the sequences) of potymer(s) on the substrate that are complementary to the receptor. 

The image file 124 is provided as input to an analysis system 126 that incorporates the visualization and analysis 

30 methods of the present invention. Again, the analysis system may be any one of a wide variety of computer system(s), 
but in a preferred embodiment the analysts system is based on a Sun Workstation or equivalent. The present invention 
provides various methods of analyzing the chip design files and the image files, providing appropriate output 128. The 
present invention may further be used to identify specific mutations in a receptor such as DNA or RNA. 

Fig. 4 provides a simplified illustration of the overall software system used in the operation of one embodiment of 

35 the invention. As shown in Fig. 4, in some cases (such as sequence checking systems) the system first identifies the 
genetic sequences) or targets that would be of interest in a particular analysis at step 202. The sequences of interest 
may, for example, be normal or mutant portions of a gene, genes that identify heredity, or provide forensic information, 
or be all possible n-mers (where n represents the length of the nucleic acid). Sequence selection may be provided via 
manual input of text files or may be from external sources such as Gen Bank. At step 204 the system evaluates the gene 

40 to determine or assist the user in determining which probes would be desirable on the chip, and provides an appropriate 
"layout" on the chip for the probes. The chip usually includes probes that are complementary to a reference nucleic acid 
sequence which has a known sequence. A wild-type probe is a probe that will ideally hybridize with the reference 
sequence and thus a wild-type gene (also called the chip wild-type) would ideally hybridize with wild-type probes on the 
chip. The target sequence is substantially similar to the reference sequence except for the presence of mutations, inser- 
ts tions, deletions, and the like. The layout implements desired characteristics such as arrangement on the chip that permits 
"reading" of genetic sequence and/or minimization of edge effects, ease of synthesis, and the like. 

Fig. 5 illustrates the global layout of a chip in a particular embodiment used for sequence checking applications. 
Chip 1 14 is composed of multiple units where each unit may contain different tilings for the chip wild-type sequence. 
Unit 1 is shown in greater detail and shows that each unit is composed of multiple cells which are areas on the chip that 

so may contain probes. Conceptually, each unit is composed of multiple sets of related cells. As used herein, the term cell 
refers to a region on a substrate that contains many copies of a molecule or molecules of interest Each unit is composed 
of multiple cells that may be placed in rows (or "lanes") and columns. In one embodiment, a set of five related cells 
includes the following: a wild-type cell 220, "mutation" cells 222, and a "blank" cell 224. Cell 220 contains a wild-type 
probe that is the complement of a portion of the wild-type sequence. Cells 222 contain "mutation" probes for the wild- 

55 type sequence. For example, H the wild-type probe is 3'-ACGT, the probes 3*-ACAT, 3-ACCT. 3'-ACGT, and 3-ACTT 
may be the "mutation" probes. Cell 224 is the "blank" cell because it contains no probes (also called the "blank* probe). 
As the blank cell contains no probes, labeled receptors should not bind to the chip in this area. Thus, the blank cell 
provides an area that can be used to measure the background intensity. 
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In one ernbotfimerrt. numerous tiling processes are available including sequence tiling, block tiling, and opt-tiling, 
as described below. Of course a wide range of layout strategies may be used according to the invention herein, without 
departing from the scope of the invention. For example, the probes may be tiled on a substrate in an apparently random 
fashion where a computer system is utilized to keep track of the probe locations and correlate the data obtained from 
the substrate. 

Opt-tiling is the process of tiling additional probes for suspected mutations. As a simple example of opt-tiling, suppose 
the wild-type target sequence is 5'-ACGTATGCA-3' and it is suspected that a mutant sequence has a possible T base 
mutation at the underlined base position. Suppose further that the chip will be synthesized with a "4x3" tiling strategy, 
meaning that probes of four monomers are used and that the monomers in position 3, counting left to right, of the probe 
are varied. 

In opt-tiling, extra probes are tiled for each suspected mutation. The extra probes are tiled as if the mutation base 
is a wild-type base. The following shows the probes that may be generated for this example: 



Table 1 



Probe Sequences (From 3'-end) 4x3 Opt-Tlling 


Wild 


TGCA 


GCAT 


CATA 


ATAC 


TACG 


A sub. 


TGAA 


GCAT 


CAAA 


ATAC 


TAAG 


Csub. 


TGCA 


GCCT 


CACA 


ATCC- 


TACG 


Gsub. 


TGGA 


GCGT 


CAGA 


ATGC 


TAGG 


Tsub. 


TGTA 


GCTT 


CATA 


ATTC 


TATG 


Wild 


TGCA 


GCAA 


CAAA 


AAAC 


AACG 


A sub. 


TGAA 


GCAA 


CAAA 


AAAC 


AAAG 


Csub. 


TGCA 


GCCA 


CACA 


AACC 


AACG 


Gsub. 


TGGA 


GCGA 


CAGA 


AAGC 


AAGG 


Tsub. 


TGTA 


GCTA 


CATA 


AATC 


AATG 



In the first "chip" above, the top row of the probes (along with one probe below each of the four wild-type probes) should 
bind to the target DNA sequence. However, if the target sequence has a T base mutation as suspected, the labeled 
mutant sequence will not bind that strongly to the probes in the columns around column 3. For example, the mutant 
receptor that could bind with the probes in column 2 is 5-CGTT which may not bind that strongly to any of the probes 
in column 2 because there are T bases at the ends of the receptor and probes (i.e., not complementary). This often 
results in a relatively dark scanned area around a mutation. 

Opt-tiling generates the second "chip" above to handle the suspected mutation as a wild-type base. Thus, the mutant 
receptor 5-CGTT should bind strongly to the wild-type probe of column 2 (along with one probe below) and the mutation 
can be further detected. 

Again referring to Fig. 4, at step 206 the masks for the synthesis are designed. At step 208 the software utilizes the 
mask design and layout information to make the DNA or other polymer chips. This software 208 will control relative 
translation of a substrate and the mask, the flow of desired reagents through a flow cell, the synthesis temperature of 
the flow cell, and other parameters. At step 210, another piece of software is used in scanning a chip thus synthesized 
and exposed to a labeled receptor. The software controls the scanning of the chip, and stores the data thus obtained in 
a file that may later be utilized to extract sequence information. 

At step 21 2 a computer system according to the present invention utilizes the layout information and the fluorescence 
information to evaluate the hybridized nucleic acid probes on the chip. Among the important pieces of information 
obtained from probe arrays are the identification of mutant receptors and determination of genetic sequence of a par- 
ticular receptor. 

Rg. 6 illustrates the binding of a particular target DNA to an array of DNA probes 114. As shown in this simple 
example, the following probes are formed in the array (only one probe is shown for the wild-type probe): 
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3 • -AGAACGT 
AGACCGT 
AGAGCGT 
AGATCGT 



As shown, the set of probes differ by only one base so the probes are designed to determine the identity of the base at 
; that position in the nucleic acid sequence. 

When a f luorescein-labeled (or otherwise marked) target with the sequence 5*-TCTTGCA is exposed to the array, 
it is complementary only to the probe 3*-AGAACGT, and fluorescein will be primarily found on the surface of the chip 
where 3'- AGAACGT is located. Thus, for each set of probes that differ by only one base, the image file will contain four 
fluorescence intensities, one for each probe. Each fluorescence intensity can therefore be associated with the base of 
each probe that is different from the other probes. Additionally, the image file will contain a "blank" cell which can be 
used as the fluorescence intensity of the background. By analyzing the five fluorescence intensities associated with a 
specific base location, it becomes possible to extract sequence information from such arrays using the methods of the 
invention disclosed herein. 

Fig. 7 illustrates probes arranged in lanes on a chip. A reference sequence is shown with five interrogation positions 
marked with number subscripts. An interrogation position is a base position in the reference sequence where the target 
sequence may contain a mutation or otherwise differ from the reference sequence. The chip may contain five probe cells 
that correspond to each interrogation position. Each probe cell contains a set of probes that have a common base at 
the interrogation position. For example, at the first interrogation position, the reference sequence has a base T The 
wild-type probe for this interrogation position is 3'-TGAC where the base A in the probe is complementary to the base 
at the interrogation position in the reference sequence. 

Similarly, there are four "mutant" probe cells for the first interrogation position, \ v The four mutant probes are 3- 
TGAC. 3'-TGCC, 3'-TGGC, and 3*-TGTC. Each of the four mutant probes vary by a single base at the interrogation 
position. As shown, the wild-type and mutant probes are arranged in lanes on the chip. One of the mutant probes (in 
this case 3*-TGAC) is identical to the wild-type probe and therefore does not evidence a mutation. However, the redun- 
dancy gives a visual indication of mutations as will be seen in Fig. 8. 

Still referring to Fig. 7, the chip contains wild-type and mutant probes for each of the other interrogation positions 
Ms- In each case, the wild-type probe is equivalent to one of the mutant probes. 

Fig. 8 illustrates a hybridization pattern of a target on a chip with a reference sequence as in Fig. 7. The reference 
sequence is shown along the top of the chip for comparison. The chip includes a WT-lane (wild-type), an A-lane, a C- 
lane, a G-lane, and a T-lane (or U). Each lane is a row of cells containing probes. The cells in the WT-lane contain probes 
that are complementary to the reference sequence. The cells in the A-, C-, G-, and T-lanes contain probes that are 
complementary to the reference sequence except that the named base is at the interrogation position. 

In one embodiment, the hybridization of probes in a cell is determined by the fluorescent intensity (e.g., photon 
counts) of the cell resulting from the binding of marked target sequences. The fluorescent intensity may vary greatly 
among cells. For simplicity, Rg. 8 shows a high degree of hybridization by a cell containing a darkened area. The WT- 
lane allows a simple visual incfication that there is a mutation at interrogation position U because the wild-type cell is not 
dark at that position. The cell in the C-lane is darkened which indicates that the mutation is from T->G (mutant probe 
cells are complementary so the C-ce!l indicates a G mutation). 

In practice, the fluorescent intensities of cells near an interrogation position having a mutation are relatively dark 
creating "dark regions" around a mutation. The lower fluorescent intensities result because the cells at interrogation 
positions near a mutation do not contain probes that are perfectly complementary to the target sequence; thus, the 
hybridization of these probes with the target sequence is lower. For example, the relative intensity of the cells at inter- 
rogation positions l 3 and l 5 may be relatively low because none of the probes therein are complementary to the target 
sequence. 
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For ease of reference, one may call bases by assigning the bases the following codes: 



Code 


Group 


Meaning 


A 


A 


Adenine 


C 


C 


Cytosine 


G 


G 


Guanine 


T 


T(U) 


Thymine (Uracil) 


M 


AorC 


aMino 


R 


AorG 


puRine 


W 


AorT(U) 


Weak interaction (2 H bonds) 


Y 


CorT(U) 


pYrimidine 


g 




Strong interaction (3 H bonds) 


K 


G or T(U) 


Keto 


V 


ACorG 


notT(U) 


H 


A,CorT(U) 


notG 


D 


A,GorT(U) 


notC 


B 


C.GorT(U) 


not A 


N 


A, C, G.orT(U) 


insufficient intensity to call 


X 


A, C, G. or T(U) 


Insufficient discrimination to call 



Most of the codes conform to the IUPAC standard. However, code N has been redefined and code X has been added. 
IMntensrlyRatinMothnrj 

: ^'rtensityratjometriaiisamet^^ 

.s most accurate when mere is good discrimination between the fluorescence ^JSm^mSSm^ 

m JZTT m "T* rafe meth0d Wi " ** described 85 ° ein ° 10 *«V one unknown base in a sample 
^^T- h ,TS* J* m8,h0d is — ' 10 "a** ° r a " the bases in a nude* acid sequent 
11,0 , u " l ™* n base w,tl be idenMied by evaluation of up to four mutation probes and a ■Wank" cell which is a location 

where abbeled receptor should not bird to the chip since no probe is present. For exarrpte, supp^eT^A seo^ 

E™f2E!!i "£!?" are t0 *> synthesized for the target sequence. A representative wild-type probe of 
fh JL^« T'^ZT'V" ,efl, ° n 01 the Sequence ^ the P 05 * 1 * 6 " Mation - "^on- p^x* will be 
TTcSTarri P ttoT 6 ^ ** 8 di,,ere^, baSeatthe ^ P 08 * 0 " 88 **""* 3'-TTAGA, ar-TTCGA, 3- 

h „h" !t!'f reS< T tl> l marked Sample ^^"ce « exposed to the above four mutation probes, the intensity should be 
h,gh^ far the probe that binds most strongly to the sample S eq U& nce. Therefore, if the probe VThSaSZ >ta 

SESESf *• ""l™"" 6886 in * e ^P' 6 «■ 9enera»y be called an A mutation because ^cVes^T 
plementar y to the sample sequence. 

T •rm^H^'^*^ 8 ' 6 'T***? *" W * Mype P"*" ^ ,hal ,h6 ' each C0rtain ^ four A. C.G. or 
LTSTn Jl T" AHh0UBh °" e 01 the " mutefion ' P"* 88 «* ° 8 ««"tical to the wild-type prabe such 

redundant probes are cntenbonally synthesized for quality control and design consistency. M 

tntnTrfi 6 " 815 '* 0 '.* 8 T"' 15356 * ^emblydetermined by evaluating the relative fluorescence intensities of up 
^IZSZTZ *1 ^ tos6 ^ P«*e is identifiable by the mutation bast 

a mutabon probe s intensity wll be referred to as the "base intensity- of the mutation base. 

As a simple example of the intensity ratio method, suppose a gene of interest (target) is an HIV protease oene with 
the sequence S^ATGTGGACAGTTGTA-S' (SEQ ,D NO:1). Suppose further that a W^S^S^ 
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I^ 6 ^ 6 IT^lf* 8 ^ m Sequence a """ton * base C to base T at the underlined base 

^orL Although hundreds of probes may be synthesized on the chip, the complementary mutation probes synthesized 
to detect amutabon m the sample sequence at the suspected mutation position may be as follows: 

3 -TATC 
j 3-TCTC 

3-TGTC (wild-type) 

S'-TTTC 

The mutation probe 3'-TGTC is also the wild-type probe as it should bind most strongly with the target sequence 
o jnten ^? er ^l S 2 UenCe * hybridi2ed on the *k «»™">> ****** me following fluorescence 

3'-TATC->45 
3-TCTC ->8 
3'-TGTC->32 
3'-TTTC->12 

s where the intensity is measured by the photon count detected by the scanner. The "Wank" cell had a fluorescence 
irtensity of 2. The photon counts in the examples herein are representative (not actual data) and provided for illustration 
Stili^^ ^ 01 theactuaJ P h oton counts will vary greatly depending on the experiment parametersand the scanner 

o base so the bases maybe said to have the following intensities: 
A->45 
C->8 
G->32 
T->12 

f ^J* 8 * A Wi,, ^ e described as a" density of 45. which corresponds to the intensity of the mutation probe 
with the mutation base A. 

Initially, each mutation base intensity is reduced by the background or "blank" cell intensity. This is done as follows: 

A -> 45-2 = 43 
C->8-2 = 6 
G -> 32 - 2 = 30 
T-> 12-2= 10 

Then, the base intensities are sorted in descending order of intensity. The above bases would be sorted as follows: 
A -> 43 
G->30 
T->10 
C->6 

Nextthe highest intensity base is compared to the second highest intensity base. Thus, the ratio of the intensity of base 

^« i I a , n i! mber mt ****' teS *" rati0 required t0 106 unknown For example, if the ratio 

cutoff is 1.2. the ratio AS is greater than the ratio cutoff (1.4 > 1.2) and the unknown base is called by the mutation 
probe contaimng the mutation A. As probes are complementary to the sample sequence, the sample sequence is called 
as having a mutation T. resulting in a called sample sequence of 5*-ATGTGGAIAGTTGTA-3' (SEQ ID NO'2) 

As another example, suppose everything else is the same as in the previous example except that the sorted back- 
ground adjusted intensities were as follows: 

C->42 

A->40 

G->10 

T->8 

The ratio of the highest intensity base to the second highest intensity base (C*) is 1 .05. Because this ratio is not greater 
than the ratio cutoff of 1 .2, the unknown base will be called as being ambiguously one of two or more bases as follows 
The second highest intensity base is then compared to the third highest base. The ratio of A:G is 4 The ratio of A G 
is then compared to the ratio cutoff of 1 .2. As the ratio A.G is greater than the ratio cutoff (4 > 1 .2), the unknown base 
is called by the mutation probes containing the mutations C or A. As probes are complementary to the sample sequence 
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the sample sequence is called as having either a mutation Q or T. resulting in a sample sequence of 5-ATGTGGAK- 
AGTTGTA-3' (SEQ ID NO:3) where K is the IUPAC code for G or T(U). 

The ratio cutoff in the previous examples was equal to 1 .2. However, the ratio cutoff will generally need to be adjusted 
to produce optimal results for the specific chip design and wild-type target. Also, although the ratio cutoff used has been 
s the same for each ratio comparison, the ratio cutoff may vary depending on whether the ratio comparisons involve the 
highest, second highest, third highest, etc. intensity base. 

Fig. 9 illustrates the high level flow of the intensity ratio method At step 302 the four base intensities are adjusted 
by subtracting the background or "blank- cell intensity from each base intensity. Preferably, if a base intensity is then 
less than or equal to zero, the base intensity is set equal to a small positive number to prevem division by zero or negative 
10 numbers in future calculations. 

At step 304 the base intensities are sorted by intensity. Each base is then associated with a number from 1 to 4 
The base with the highest intensity is 1, second highest 2, third highest 3, and fourth highest 4. Thus, the intensity of 
base 1 £ base2 s base 3 s base 4. 

At step 306 the highest intensity base (base 1) is checked to see if it has sufficient intensfty to call the unknown 
15 base. The intensity is checked by determining if the intensity of base 1 is greater than a predetermined background 
difference cutoff. The background difference cutoff is a number that specifies the intensity a base intensity must be over 
the background intensity in order to correctly call the unknown base. Thus, the background adjusted base intensity must 
be greater than the background difference cutoff or the unknown is not callable. 

If the intensity of base 1 is not greater than the background difference cutoff, the unknown base is assigned the 
20 code N (insuff icient intensity) as shown at step 308. Otherwise, the ratio of the intensity of base 1 to base 2 is calculated 
as shown at step 310. 

At step 312 the ratio of intensity of bases 1 2 is compared to the ratio cutoff. If the ratio 1 :2 is greater than the ratio 
cutoff, the unknown base is called as the complement of the highest intensity base (base 1) as shown at step 314 
Otherwise, the ratio of the intensity of base 2 to base 3 is calculated as shown at step 316. 
25 At step 318 the ratio of intensity of bases 23 is compared to the ratio cutoff. If the ratio 2:3 is greater than the ratio 
cutoff, the unknown base is called as being an ambiguity code specifying the complements of the highest or second 
highest intensity bases (base 1 or 2) as shown at step 320. Otherwise, the ratio of the intensity of base 3 to base 4 is 
calculated as shown at step 322. 

At step 324 the ratio of intensity of bases 3:4 is compared to the ratio cutoff, ff the ratio 3:4 is greater than the ratio 
so cutoff, the unknown base is called as being an ambiguity code specifying the complements of the highest second 
highest, or third highest bases (base 1, 2 or 3) as shown at step 326. Otherwise, the unknown base is assigned the 
code X (insufficient discrimination) as shown at step 328. 

The advantage of the intensity ratio method is that it is very accurate when there is good discrimination between 
the fluorescence intensities of hybrid matches and hybrid mismatches. However, if the base corresponding to a correct 
as hybrid gives a lower intensity than a mismatch (e.g., as a result of CTOss-hybridization), incorrect identification of the 
base will result. For this reason, however, the method is useful for comparative assessment of hybridization quality and 
as an indicator of sequence-specific problem spots. For example, the intensity ratio method has been used to determine 
that ambiguities and miscalls tend to be very different from sequence to sequence, and reflect predominantly the com- 
position and repetitiveness of the sequence. It has also been used to assess improvements obtained by varying hybrid- 
40 ization conditions, sample preparation, and post-hybridization treatments (e.g., RNase treatment). 

III. Reference Mathnri 

The reference method is a method of calling bases in a sample nucleic acid sequence. The reference method 
45 depends very little on discrimination between the fluorescence intensities of hybrid matches and hybrid mismatches 
and therefore is much less sensitive to cross-hybridization. The method compares the probe intensities of a reference 
sequence to the probe intensities of a sample sequence. Any significant changes are flagged as possible mutations 
There are two implementations of the reference method disclosed herein. 

For simplicity, the reference method will be described as being used to identify one unknown base in a sample 
so nucleic acid sequence. In practice, the method is used to identify many or all the bases in a nucleic acid sequence. 

The unknown base will be called by comparing the probe intensities of a reference sequence to the probe intensities 
of a sample sequence. Preferably, the probe intensities of the reference sequence and the sample sequence are from 
chips having the same chip wild-type. However, the reference sequence may or may not be exactly the same as the chip 
wild-type, as it may have mutations. 
55 The bases at the same position in the reference and sample sequences will each be associated with up to four 
mutation probes and a "Wank" cell. The unknown base in the sample sequence is called by comparing probe intensities 
of the sample sequence to probe intensities of the reference sequence. For example, suppose the chip wild-type contains 
the sequence 5^AGAC£TTGC-3' and it is suspected that the sample has a possible mutation at the underlined base 
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position. which feme unknown base that win be called by the reference method The -mutation" probes for the sarrcle 
sequence may be as follows: 3 -GAAA. 3-GCAA, 3-GGAA, and 3M3TAA, where 3'-GGAA m^J^SS^ 

600^^^™^^^™-^''* diflerS " 0m the by one base mutation, has the 

sequence S-AGACATTGC-3 where the mutation base is underlined. The -mutation- probes for the reference seouena. 
may be as tallows: 3'-TQAAA, 3'-TGCAA. S'-TGOAA. and 3'-TGTAA, where S^TGTM^ ^7^ ^^^ 

ZHT*™ * kn0Wn ' AWmJBh flen8nil| y » B W and reference sequent wleSt 
same dip* w.ld-type. this is not required, and the Wing methods do not have to be identical shown M*m£ Z 
probe lengms in the example. Thus, the unkn™ base wfl be caned by centring the -tnjfor^«to££ 

™nC^! ' ^ Pf0b8S ' intenSiS9S 1,6 to as the "base internes" of their respective 

As a simple example of one implementation of the reference method, suppose a gene of interest (taraet) has the 

IS^^'T dlffer * fr ° m target sequence by the underlined base. The reference sequence is marked 
slXfs^cf^oh? f ^ me "°* 6eqUenCe Mn » * e **» Suppose turner thata sa^e 

sequence "ssuspected to have the same sequence as the target sequence ex(^ to a mutatiai at the underlirwd base 
^I^;^ CIQAAM : 3 '. (SEQ 10 N0:4) - The sequence is aZnarked arrf eC^d top^S^a 

chp wrth the target sequence being the chip wild-type. After hybridization and scanning, the follow^. orobeWen^ 
(not actual data) were found for the respective complementary probes' 





Reference 


Sample 


25 


3'-TGAC->12 


3'-GACT ->11 




3-TGCC ->9 


3-GCCT->30 




3-TGGC -> 80 


3-GGCT -> 60 


30 


3*-TGTC->15 


3-GTCT->6 



Reference 


Sample 


A-> 12 


A->11 


C->9 


C->30 


G->80 


G->60 


T-> 15 


T->6 



U 1 ^ ^? « "* II^TI! SeqU6nCe Wi " be described as «" intensity of 12, which corresponds to the intensity 
l fl inT!T T* "* base A. The reference method will now be described ascal ling the JEZ 

base in the sample sequence by using these intensities. 

r«forrit n 1 ^^ , ^l h J2 h l *f te * 0,one Cementation of the reference method. For illustration purposes, the 

^s s^r^Tr? 698 "^^ 3,1 table 18 "°« to Practice the method. ThVa^lysis 

tawe is shown to aid the reader in understanding the method 
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5 


Reference 


Sample 




A->12- 1 = 11 


A-> 11 - 2 = 9 




C -> 9 - 1 = 8 


C -> 30 - 2 = 28 




G -> 80 - 1 = 79 


G-> 60-2 = 58 


10 


T->15-1 = 14 


T -> 6 - 2 = 4 



to prevent division by zero or negative numbers in future calculations. 

5 , J.?' We * fca8 ? n ' * e P"** 0 " of each base of interest in the reference and sample sequences is placed in column 
1 « the analysis table. Also, since the reference sequence is a known sequence. ^bSmm^iZT^Z 

for^eZT^^ 

» tn JZ ^L 40 * !; e . b !f e ' n,enSi,y a6socia,ed refer ence wild-type (column 2 of the analysis table) is checked 

tVVl ^? SU, " a . en, " ltenaty 10 08,1 0,8 unknown base - f' 8 «arrple. the reference wiW^eis C t&eWtht 
irrtenaitiMart as ^ oaa ^ "j* 1 * he wild-J ype is the G base intensity, which is 79 in this example. Th^te berau^ttiebase 
teZH^lV^r th " C ° mplementary P robes - ^ G base intensity is checked by determining I 

» u^E^^Tl T ■* nB "" mUSt 1,6 above *• >>«*»««« density in order to correctly call the 

Z^™~ U ,k B,e ^ SB aSSOdated 1,16 re,erenee be greater than the background 

difference cutoff or the unknown base is not callable. 

If the background difference cutoff is 5. the base intensity associated with the reference wild-type has sufficient 
an F (fail) is placed in column 3 of the analysis table. ^ 

' , ? f!? ^® me rati0 01 lhe tase in,ensi,y assodated «*» «» reference wild-type to each of the possible bases are 
cata.la.ed. The ratio of the base intensity associated with the reference wild-tvpe to itself will beTS VZ "Z 
^s^yr^ greater tr^l. The base Wensrfyassoc^ 

G:A -> 79 / 1 1 = 7.2 
G:C -> 79/8 = 9.9 
G:G-> 79/79 = 1.0 
G:T-> 79/14 = 5.6 

These ratios are placed in columns 4 through 7 of the analysis table, respectively 

At step 410 1 the highest base intensity associated with the sample sequence is checked to see if it has sufficient 

oTth^^ 

bJS^J ^ckground difference cutoff is 5, the highest base intensity, which is G in this example, has sufficient 

anF(fai()isplacedin column 8 of the analysis table. ^ 

r»tio^ S I?h 4 ^ e K rati0S ^ the hi9heS * 1)389 intensity 01 0,6 sam P ,G to *»* <* *• Possibie bases are calculated. The 
rate of the h.ghest base intensity to itself will be 1 and the other ratios will usually be greater than 1 Thus, the hiohelt 
base intensity is G so the following ratios are calculated: lm ,nus « men, 0nest 

G:A-> 58/9 = 6.4 
G:C->58 / 28 = 2.3 



G:G-> 58 / 58 = 1.0 
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G:T-> 58/4 * 14.5 

These ratios are placed in columns 9 through 1 2 of the analysis table, respectively. 

At step 416 if both the reference and sample sequence probes failed to have sufficient intensity to call the unknown 
s base ^ meaning there is an T in columns 3 and 8 of the analysis table, the unknown base is assigned the code N 
^raeM'rt^^ 

code of 9 is placed in column 18 of the analysis table where the confidence codes have the following meanings- 



Code 



0 
1 
2 
3 
4 

5-8 
9 



Meaning 



Probable reference wild-type 
Probable mutation 

Reference sufficient intensity, insufficient intensity in sample suggests possible mutation 
Borderline differences, unknown base ambiguous 

Sample sufficient intensity, insufficient intensity in reference to allow comparison 
Currently unassigned 

Insufficient intensity in reference and sample, no interpretation possible 



The confidence codes are useful for indicating to the user the resulting analysis of the reference method 
mereisan F.ncolumn3anda^^^ 

.rtensrty) as shown at step 422. An 'N; is placed in column 1 7 and a confidence code of 4 is placed in column 18 of the 
analysis table. 

♦k~ * Stt 5. 424 ff ,° nhf If 19 S f m 2 8 S9quenCe probes tei,ed to have suffjcienl density to call the unknown base, meaning 
there is a P in .column 3 and a F in column 8 of the analysis table, the unknown base is assigned the code N (insufficient 
intensity) as shown at step 426. An 'N' is placed in column 17 and a confidence code of 2 is placed in cdumn^8oHhe 
analysis table. 

In this example, both the reference and sample sequence probes have sufficient intensity to call the unknown base 
At step 428therat.os of the reference ratios to the sample ratios for each base type are calculated. Thus, the ratio A:A 
Mumn 4 to column 9) is placed in column 13 of the analysis table. The ratio C:C (column 5 to column 10) is placed in 
column ,14 of the analysis table. The ratio G:G (column 6 to column 1 1) is placed in column 15 of the analysis table 
Lastly, the ratio T:T (column 7 to column 12) is placed in column 1 6 of the analysis table. These ratios are calculated as 



55 



A:A •> 7.2/6.4 = 1.1 
C:C->9.9/2.3 = 4.3 
G:G-> 1.0/1.0* 1.0 
T:T-> 5.6/ 14.5 = 0.4 

The unknown base is called by comparing these ratios of ratios to two predetermined values as follows 

^^Jf 0 rf ^ e ' atios 01 ratios ^ fumns 13 to 16 of the analysis table) are less than a predetermined lower 
rato cutoff, the unknown base is assigned the code of the reference wild-type as shown at step 432. Thus, the code for 
the reference wild-type (as shown in column 2) would be placed in column 17 and a confidence code of 0 would be 
placed in column 1 8 of the analysis table. 

Atstep 434 if all the ratios of ratios are less than a predetermined upper ratio cutoff, the unknown base is assigned 
anamb.gu.ty code that indicates the unknown base may be any one of the bases that has a complementary ratio of 
ratos greater ttjan the lower ratio cutoff and less than the upper ratio cutoff as shown at step 436. Thus, if the ratio of 
ratios fa A A C:C and G:G are all greater than the lower ratio cutoff and less than the upper ratio cutoff, the unknown 
base would be assigned the code B (meaning "not A"). This is because the ratios of ratios are complementary to their 
respective base as follows; 
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A:A->T 
C.C->Q 
G:Q->C 



^TltTT^ , 66 ^'^ ^ ^ eith6r C. Q. or T. which is identified by the IUPAC code B. This ambgurty 

A^i^^JT T 17anda COn,idence co* of 3 would be placed in column 18 of the analysis table. 
„ * 438 a least one o» * e ratios of ratios is greater than the upper ratio cutoff and the unknown base is called 
J? «"nplOTentary to the highest ratio of ratios. The code for the base complementary to the highest ratio of 
ratios would be placed in column 17 and a confidence code of 1 would be placed in column 18 of the anaNs^stable! 

rat^^as^^ 



A:A -> 1.1 
C:C -> 4.3 
G:G ->1.0 
T:T -> 0.4 



toft! Lh«t t TlZ*^ L^**" the rati0 ^ the unknown *** is *e base complementary 
to the highs* ratio .of rates. The highest ratio of ratios is C:C. which has a complementary base G. Thus, the ijnkncZ 
2B base * called G which is placed in column 1 7 and a confidence code of 1 is placed in column 18 of the analysis table. 
^ ^ e example ^ me unknown 1)386 in the samDle nudeic acid sequence was correctly called as base G 
AWxwgh the corrplementary "mutation- probe associated with the base G (3'-GCCT) did not have the highest fluores- 
cence .ntensity the unknown base was called as base G because the associated "mutation" probe had the highest ratio 
increase over the other "mutation" probes. 

30 ™jl 9 ' 1 1 £ llusfr3tes me hia * h level ,,ow <* another implementation of the reference method. As in the previous imple- 
me^on ,th.s implernentaton also compares the probe imens^ 

S lTt ^' ** imp,ementation differs conceptually from the previous implementation in that neigh- 
bonng probe intensities are also analyzed, resulting in more accurate base calling 

CATC^sfmn^n^ ^'i^ 0 ' D N ° :6) Bnd 3 ^ Sequence has a sequenced 5'-AAACCCA0TCCA- 
SSS J Sre the mutant 5386 is underiined - »here is a mutation of A to G. Suppose further 

mat the reference and sample sequences are tiled on chips with the reference sequence being the chip wild-type This 
implementation of the reference method will be described as identifying this mutation base 
An . J 0 ^^^!^' im P lementeti °n of the reference method is described as filling in a data table shown 
ZlLE^£l SE ° ' D N ° :28 ' SE ° ,D NO:29, • ^ough the data table contains nJe data than is rS 
* * 106 ** taWe ^ P roduced b y steps in Fig. 11 A are shown with ftTsame 

£^Tr^ 01 8 data taWe iS necessar * "owever, and is shown to aid the reader in under- 

H 6356 ^ is at posJtion 241 in me reference ^ ■«* s ^ ences - ^ is 

45 ^ e base j ntensit,es of the reference and sample sequences are adjusted by subtracting the background 

or "blank ceil .intensity from each base intensity. Preferably, if a base intensity is then less than or equal to zero the 

^L^T^ 8 T^l* ^ numb6r *° pr&/ent division ** zero or num bers. In the data table 

k tfie .^ Ck ^ 0U ^ Subtracted 0386 intensities for ^ Terence sequence and data 502B is the background 
subtracted base mtensrt.es for the sample sequence (also called the "mutant" sequence in the data table) 
tnJ^f* Si!^ base 1 in1ensit y associated with the reference wild-type is checked to see if it has sufficient intensity 
n i ^ f nthK example, thereferencewild-typeis base A at position 241. The base intensity associated 
wtf. ttie reference wild-type is identified by a lower case "a" in the left hand column. Thus, the base intensities in the 
date table are not tfentified by ttieir complements and the reference wild-type at the mutation position has an intensity 
of 385. The reference wild-type intensity of 385 is checked by determining if its intensity is greater than a predetermined 
es bac^nddiflerence^^ 

T ba ?^ ound intensf * in order to correctly call the unknown base. Thus, the base intensity associated 
wrth the reference wild-type must be greater than the background difference cutoff or the unknown base is not callable. 
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iXTSSr:^ os shown at step 50, CXhe^, a, step 508 the w^ype 

in JL S !f-f!° c f^ ons ^ P 8 *""* « »» background subtracted base intensities of the reference seouence 
"TST I 9 in1enSi,ieS - Eaeh P 05 " 0 " in *• ^rence sequence has four batolTSZ 

5=a==H==- sSsssSS 

Aratio = A/(A + C + G + T) 
Cratio = C/(A + C + G + T) 
Gratio = G/(A + C + G + T) 
Tratto = T/(A + C + G + T) 

wSl* 1 ' ^ a ^ UW 66 916 wad - ,ype ^ 71,888 raKos are fl^'ly calculated in order to 'normalize- the 
Pjjcton counts may vary widely from experiment to experiment Thus, the ratios prS way 

S JH^TT 1 ^ enCe " " 11,6 te "* 8 reate ' ** the background difWe^^e SCe 

py^^^^ 

ordefto'^r^ 8 ,^^ "J P *?T d 00 918 ^ckground subtracted base intensities of the sample sequence in 
%1 l™ ^ !.. 9 T**"*- EaCh P osition in 9,8 ""V' 8 a*!"™* has four background subtracted tie irten 
f^TTf^ *• ? 6 ^ 01 ,he intens ^ C each base to the sum of the HensLs of ^ases^M 
four) are calculated, resulting in four ratios, one for each base as shown in the data table ' 

assion^^.! Tr™ T'lT-^ * Sampl ° se " uences <° "ave sufficiem intensity, the unknown base is 
assigned the code N (insufficient intensity) as shown at step 522 0358 15 

irte^ 6P of ?h 4 »!lT ma ' i2ed baS l! ntenSitieS <* the referenc e sequence are subtracted from the normalized base 
.ntensrt.es of the sample sequence. Thus, at each position the following calculations are performed: 

A Difference = Sample A Ratio - Reference A Ratio 

C Difference = Sample C Ratio - Reference C Ratio 

G Difference = Sample G Ratio - Reference G Ratio 

T Difference = Sample T Ratio - Reference T Ratio 

= 526 T* 1 P0s, ' licn te cheeted 10 see " fe a base difference greater than an upper difference cutoff and 

ate* .difference tower than a lower difference cutoff. For example. Fig. ifc shows a XISS 
S™^ 65 mnus me "o^ed reference base intensities. Suppose that the upp^eZ^Z ifoS 
is - 01 5 as enown by the horizoma. lines hF? , C. At tZlE^SK EE 
^^•k ■ d "l! renCe 18 0 28 is 9'eater than 0.15. the upper difference cutoff. Simtarty the^dffle^e 
VJl^ T„ ^» el ~er difference cutoff. As there is a base difference ab^ ^u^Senle 
cutoff and a base difference betow the lower difference cutoff, there may be mutation at this position 

^'.^n* *2 P ^ 0n ' S f Sflned code of the reference wild-type base as shown at step 528 

teckgraund subtracted base .ntensity in the sample is 571 (base G). The background subtracted iXnm Mm£ 
base intensity is 385 (base A). The ratio of 571 :385 is calculated and resutts inT.48 as ZT^ ^T 



15 



EP 0 717 113 A2 



at position 242 (which Si aa ZV^X J. T^' *• mutet,0n P 08 * 00 241in,h8 teUe. the ratio 
seajjeree^AtoG) ^ ' 8,9 Sequence 18 caI,ed 8 tase G - «•** * • Tom the reference 

IV. Stattsticflt MftttlfKl 



Reference 


Sample 


3-.TGAC 


3'-GACT 


3--TGCC 


3--GCCT 


3-TGGC 


3'-GGCT 


3'-TGTC 


3'-GTCT 



The "mutation" probes shown for the reference sequence may be from only 



one experiment, the other experiments may 
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hare different "mutation- probes, chip wild-types, tiling methods, and the like. Although each fluorescence intensity is 
from a probe, since the probes may be identified by their unique mutation bases, the probe intensities may be identified 
by their respective bases as follows: 



Reference 


Sample 


3--TGAC -> A 
3'-TGCC -> C 
3*-TGGC->G 
3-TGTC -> T 


3'-GACT->A 
3*-GCCT->C 
3'-GGCT->G 
3-GTCT ->T 



Thus, base A of the reference sequence will be described as having an intensity which corresponds to the intensity of 
themutation probe with the mutation base A. The statistical method will now be described as calling the unknown base 
in the sample sequence by using this example. 

Fig. 12 illustrates the high level flow of the statistical method. At step 602 the four base intensities associated with 
the sample sequence and each of the multiple reference experiments are adjusted by subtracting the background or 
"Wank 1 cell intensity from each base intensity. Preferably, if a base intensity is then less than or equal to zero, the base 
intensity is set equal to a small positive number to prevent division by zero or negative numbers 

At step 604 the intensities of the reference wild-type bases in the multiple experiments are checked to see if they 
aJlhave sufficient intensity to call the unknown base. The intensities are checked by determining if the intensity of the 
reference wM-type base of an experiment is greater than a predetermined background difference cutoff. The wild-type 
probe shown earlier for the reference sequence is 3^TGGC.arxlthiJS the G base inter^ty is the wild-type base irtens^ 
These steps are analogous to steps in the other two methods described herein. 

If the intensity of any one of the reference wild-type bases is not greater than the background difference cutoff, the 
wild-type experiments fail to have sufficient intensity as shown at step 606. Otherwise, at step 608 the wild-type exper- 
iments pass by having sufficient intensity. 

At step 610 calculations are performed on the background subtracted base intensities of each of the reference 
expenments in order to "normalize" the intensities. Each reference experiment has four background subtracted base 
irtensit.es associated with it: one wild-type and three for the other possible bases. In this example, the G base intensity 
is the wild-type, the A. C, and T base intensities being the "other" intensities. The ratios of the intensity of each base to 
the sum of the intensities of the possible bases (all four) are calculated, giving one wild-type ratio and three "other" ratios 
Thus, the following ratios would be calculated: 

A ratio = A/(A + C + G + T) 

Cratio = C/{A + C + G + T) 

G ratio = G/(A + C + G + T) 

Tratio = T/(A + C + G + T) 

where G ratio is the wild-type ratio and A, C, and T ratios are the "other ratios. These four ratios are calculated for each 
reference expenment. Thus if the number of reference experiments is n, there would be 4n ratios calculated These 
ratios are generally calculated in order to "normalize" the intensity data, as the photon counts may vary widely from 
expenment to experiment. However, if the probe intensities do not vary widely from experiment to experiment the Drobe 
intensities do not need to be "normalized." K 

At step 612 statistics are prepared for the ratios calculated for each of the reference experiments. As stated before 
each reference experiment will be associated with one wild-type ratio and three "other" ratios. The mean and standard 
deviation are calculated for all the wild-type ratios. The mean and standard deviation are also calculated for each of the 
other ratios, resulting in three other means and standard deviations for each of the bases that is not the wild-type base 
Therefore, the following would be calculated: 



Mean and standard deviation of A ratios 
Mean and standard deviation of C ratios 
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Mean find standard deviation of G ratios 
Mean and standard deviation of T ratios 

s where the mean and standard deviation of the G ratios are also known as the wild-type mean and the wild-type standard 
dev«ton respectively. The mean and standard deviation of the A, C, and T means and standard deviations are also 
known collectively as the "other" means and standard deviations. i 
Suppose that the preceding calculations produced the following data: 

10 A ratios -> mean * 0.16 std. dev. = 0.003 

C ratios -> mean = 0.03 std. dev. = 0.002 

G ratios -> mean = 0.71 std. dev. = 0.050 

15 

T ratios -> mean = 0.1 1 std. dev. = 0.004 

'noneembodimem.thestepsupto 
wiw-type experiments. The results of the preprocessing stage are stored in a file so that the reference calculations do 
20 not have to be repeatedly calculated, improving performance. 

At step 614 the highest base intensity associated with the sample sequence is checked to see if it has sufficient 
intensity to call the unknown base. The intensity is checked by determining if the highest intensity unknown base is 
greater than the background difference cutoff. H the intensity is not greater than the background difference cutoff the 
sample sequence fails to have sufficient intensity as shown at step 616. Otherwise, at step 618 the sample sequence 
25 passes by having sufficient intensity. 

At step 620 calculations are performed on the four background subtracted intensities of the sample sequence The 
ratios of the background subtracted intensity of each base to the sum of the background subtracted intensities of the 
possible bases (all four) are calculated, giving four ratios, one for each base. For consistency, the ratio associated with 
the reference wild-type base is called the wild-type ratio, with there bong three "other" ratios. Thus, the following ratios 
30 are calculated: M 

A ratio = A/(A + C + G + T) 
Cratio*C/(A + C + G + T) 

35 

G ratio = G/(A + C + G + T) 

T ratio = T/(A + C + G + T) 

40 where ratio G is the wild-type ratio and ratios A, C, and T are the "other ratios. 

Suppose the background subtracted intensities associated with the sample are as follows- 
A*>310 
C->50 
G->26 
45 T->100 

Then, the corresponding ratios would be as follows: 

A ratio » 310/(310 + 50 + 26 + 100) = 0.64 

50 C ratio = 50/ (310 + 50 + 26 + 100) = 0.10 

G ratio = 26 / (310 + 50 + 26 + 100) = 0.05 

T ratio = 100 / (310 + 50 + 26 + 100) = 0.21 

55 

At step 622 if either the reference experiments or the sample sequence failed to have sufficient intensity the 
unknown base is assigned the code N (insufficient intensity) as shown at step 624. 

At step 626 the wild-type and "other" ratios associated with the sample sequence are compared to statistical expres- 
sions. The statistical expressions include four predetermined standard deviation cutoffs, one associated with each base. 
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vafue. Sur^tnes^^^^ 

A standard deviation cutoff -> 4 
C standard deviation cutoff -> 2 
G standard deviation cutoff -> 8 
T standard deviation cutoff -> 4 
Thewild-type base rato associated w«h the sar^efeco^ed to a corresponding s^express,^ 
WT ratio e WT mean - (Wr std. dev. • WT base std. dev. cutoH) 

0.05 a 0.71 -(0.050 * 8) 
0.05 & 0.31 

which is not a true expression (0.05 is not greater than 0 31) 

Each of the "other" ratios associated tfft the ^ie is conpar^ 

Other ratio > Other mean + (Other std. dev. * Other base std. dev. cutoff) 

A -> 0.64 > 0.16 + (0.003 * 4) 
X 0.64 > 0.17 

^ C -> 0.10 > 0.03 + (0.002 * 2) 

0.10 > 0.03 

40 T-> 0.21 > 0.11 + (0.004*4) 

0.21 > 0.13 

which are all true expressions. 

corrplements of Z^^^Z^^^l^T^^Z"^ 
« all greater than their c^t«4 Zl t r ^1*° ratios for A, C. and T were 

of these bases. representedb^esubeetTQ ZT^^ ll^T" **** WCuW be eal,ed •» implements 
•notC"). "tea oy ma subset T. G, and A. Thus, the unknown base would be assigned the code D (meaning 

" e*pel\~;^ « — data from muKpfe reference 

and calling of mixed sequel alS ° b8enusedto Moment confidence estimates 
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V-PQQlingPrQEgssjng 



* heran. the referee and W^S^^l TUTT '"^'^^^imentdiscuseed 
differem^aveleng**. ^X^Z^^TIT^^T^^^^ 
radioactive markers. * other lypes «* "afters including distinguishable 

detect, ngtargelslabeledv«LiH«ertn^^sSJrn i <s A^f f If Pf ° MSSed t09ether ' «PP«" «* 
by reference tor all purposes. " U& App " Cat,on 08/195.889 and is hereby incorporated 

nud^s^S 

is marked with, dye that. uZZ^^^^T^'Z^ 704 ^ sample nucleic acid sequence 

* reference sequence. rTex^e £^Tn^S IT ** " ^ ,lu0rescent <* e * 1,18 
sample nucleic add secmeZmay be l^ZTZr^lT^ ^ ** ""to** AJternaBvely, the 
to streptavtfn labeled^ ^^"^^ S^TT "** Wi " ^uerT/bind 
kinds of markers (e.g.. radioactive) aTtono^h^tT^ "** be martod *** or other dyes or other 

Wsteproetri.aUaS™ 
« ^ntnuesinmesanUnTaVT^^^ 
«~ac*s^encesa,e,h^ 

c°"edion. the date fteL^ 
« focusing excitation light on f^rSEto^XS d^JTfh « """^ "* Seanner flenera,es im *9e file by 
the tarescent Dght can beZll^^?!?^ " 9h1 •* fe T*» ™*er errfttir^ 

is about 530 nmLethmo,^^^^ 

^Z:Z^^^Z^1ZT^ "~ l » **-"» the locals 

sequences by utBiangm^^^^ 

5 n-^r^Zse^^^ 

and removing false positives for mutated wne£ Sc^t^f .T^"*' ***** muWple ^ 
mutation. These methods utilize mmmm froTJ^Z ^ *" 6rr0neousty been "o"**' as a 
mutations. The interrogation posifon on STSS 008 p0ation to WenS,y *• position of 

> more efliden. use J£2 ca^e^Ss t^lT TT* **** nMaBons *** 
^.^hereintoefrlC^^^ 

arr^rar^rir^r^ 
'^ecausethecells*^^ 

to the sample sequence Thus therrvhi^nZn^iTS^ ? otco ^P^'hatarepertectrycoinpleinentary 
o^rk^^ 

caseTucS^Svr^^ 

is identical to the re1erer«n2u^«c^^m^2^ * en ~^^ ^ to °wn and the sarrple sequence 
s^erKeswerethenpro^^to^ T ^ 6amp,e •* retere "<» 

tr»wild-rypece.,rZ!^^^ 

rype ceils numbered 1 1. 24, 39. etc. are 1 . The low photon counts are due to the fad that there are no 
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* known. Consequently, the .rrtermrtterrt wild-type cells that have a photon count of 1 do not represent erroneous 

^^r^Tj",^!!!* - 816 SCaled Ph ° t0n Counte «* 9,6 """^ P"*es hybridizing with the sample and reference 
*2? ^^•^ ** tw " bubb,es -" * bubble 730 has a top curve defined byTe photon counte 

I Z^ rtt m9 Sample 8e « uenc * ****•» bubble 730. there is iJEXEZ 

Alter section 732 is another bubble 734 which again has a top curve defined by the hybridization of the reference 

roTe™;^^ 

Jj"* in F * 14A COrreSp0nds 10 a re 9' on surrounding a single mutation. Because the wild-type probes 

* G rel^ytower which results in lower photon counts. Much information about ■SuvS 

may be acquired by a detailed analysis of these bubble regions. sequence 
« . ]^, w * ^^^.ei^^es whether there is a false positive, a single mutation or a multiple mutation If there 
ZX&ZZXZZ* HZ T Sh0Uld WO*™*, equal to ti,e probe length. & ZSTi m 
was produced utilizing 20-mer probes. Accordingly, bubbles 730 and 734 are approximately 20 wild-type celfe wide 

%£%X£ZT !T ^ W8re pr0dUced by ^ mutatoK ' ™ 8 °< St^ a 
S 156 •««*""■»* equal to the probe length because each of the probes in this region have 

a single base mismatch with the sample sequence. 

If the width of the bubble is substantially less than the probe length, the bubble may represent a false oosttive For 
example, assume that * .wild-type ceD number 45 in Fig. 1 4A. the r^izaSon of the Ld^X^Te» 
ZXZ^ZlZ'J^u T 1000 Ph0ton °° Un1S) - A ^ «■* cX the basTaccc^to 

0^X^31? T^J*^ ^ indiCate ^ there B 8 muta,ion at ,his P 08 * 0 "' However, the tow 
photoncounts may be due to dust on the chip and not due to tower hybridization. Since the width of this bubtte would 

T°J" * °' 2 °- *• pMon "« at -"-type cell 45 w«Sd nTbe 
due to a mutation (i.e.. there is no dark region surrounding that position) 

w^« to d^S^fn^?-. ^ ""J"" 8 ^ 0,18 o^Wing *>* region. The analysis of such a bubble 
wiu De discussed in more detail in reference to Fig. 1 4C 

T" 01 I* T*" 18 " tthin * e M mentol « 1 the 20-mer probes orTthe chto haZ 

interrogaton position at the 12* base position in the probe. Thus, the base at the 12* base position is thetasTtha^ 

h,JT e 7^ mU S 1 'i n T-S^* 19 730 ocm 81 8,9 12th P 08 * 0 " < fr ° m * e '<>«)• Additonany. the actual mutation in 
bubWe ^occurs at the 12th pos.tJon (from the left). Thus, as the graph shows, there are 1 1 bases to the ^ each 
rnutatona^Sb^estothengMofeachrr^^ 

the present (nventon can heto to identify the probable location ol a mutation within a dark ^^bubSe 
.^2,^ melhod * 8nWies locations that may have a mutation, more efficient base calling 

^.vLTJ^ t e T T !?l an °' ****** 730 that *« * to be a single mutat^S 

oSinl dJ^™ T !" baS !^ B " 9 ° CCUr in ,h9 <** "**°<* surrounding a nLson. Man^se 
P^toes n th* dark zone can now be ehminated because they are incompatible with the bubble size (which indicates 

l7^T££T* ey f?' ^ ^"Odearty a "mismatch zone," we can now apply algorithms that factoM^ 
the effect of a mismatch or multiple mismatches. 

of th^?f^!.f^ °" ha may indicate what mutation has occurred. Rg. 14B shows a hypothetical graph 
ofthe tomri .rrtensrt.es vs. cell locations for wild-type probes hybridizing with two sample sequences anZe 
h^n^H^T 6 - ^ mSmatCh 66 ^ destaWizi "9 '° hybrklizadon than a U^3 mismatch. As^n 
the parteular mutation by pattern matching bubbles stored in a library. 

Fig. 14C shows a graph of the fluorescent intensities (photon counts) of the wPd-type probes hybridizino with the 

VlT^SSi nTheflraphWaS P rod ^ frOTa *P'^'"920-mer probes wimanimerrogato 

n^^T' *****J 750 is L 2 , 7baseswld8 in<licati "fl t»»t the bubble was produced from the dark regions surrounding 
ZITJTI^TJ* 27 9r . ea,er ttol20OTthe le "Sth of the probes. In addition to providing information thai 
there are multiple mutations, analyse of the bubble indicates the probable position of two ol the mutations. Because the 
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^iXrt sfm B ^3^^ « and reference sequence are measured, the system identifies a bubble 
resjon at step 780. Bubble regions are identified as regions where the hybridization of the wild-rvra orohes m h» 

lessS*£^ 

JSS5S! ^^^ rat0nrih ~ d ^* 9b ^^^«^tomepr*e length ™ y vary a^ 

Th 0 'IH^* 1 ^ I" a ™ all >' """a P™* I«n3th. »e bubble represents multiple mutations at step 790 

The system performs base calling at likely locations of mutations at step 792 The likely locatoT^*^^. 
detained by both the M of the bubble and the location of the W^poZ^S aSS £ 
TnZ TT" "f" **** 10 d6,errrtne the ^PacJicXatioS ^WbSl' £ 

nin^^tT', th9 , S !? em Pr ° dUCeS C0nfid6nces ,hat *• "«*«ons are identified correctly. Each confidence is deter- 
iHhfrXTJ^ e ' erimen1a, *ta matched the date expected for the mutation that was cX For e^e 
£££2? !^ aS r Cfl1 ' 818 Mme 88 816 probe and the base calling method ideriied muMo^We 

«rr,s^^^ 

htSt^d^fr^rr^ I^ 868 m6ttl0dS ^ a^S^y performed wrth pooling processing and utilize 
T^T^T^rZ 0,16 6356 P 051 *" 1 ,0 «an«y the likely position of mutations. The irterrogaUon poslt^ 



of base calling methods. 
VI. Comoarativfl Ana^ fx/i^Q^ 



meth^^iTl^^r 1 ^ m6th0d <* and tfsuaJization of multiple experiments The 

The rrfLZI 8 ^ 0 " Wind0W iS Ctvid9d into a reterence 8 «l u " 1 '=e area 81 4 and a sample sequence area 816 
w'^ZJSZ™ 1" *«™i>™™ SequereeS 3,6 and is divided into a^ere^ceZe stterw 

S^andrefer^tasesubareaSZO.Thereferenceramesubareaissr^ 

chp. The reference base subarea contains the bases of the reference sequences. A capital C 822 is displayed to the 
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h^a^CeS 

sequences that have not £££££ IT" ^ *° "* ™V be Known 

reference sequences other than the ItowW .L a,?l^^ assoaated fluorescence intensities. The 

5 ASCII text files. ™n <ne cnp w,« type are used for sequence comparisons and may be in the form of simple 

w^e^Z^^ 6 «™ « comparison 

subarea 826. The «JS^ln«2S! itS JTn ^ a ^ e ^ ^8 884 and sarrple base 
extensions indfc^etheCnXXSl ^ *" Sample set " J8TO « i - ™° «ename 

.0 den°lesmereferer^rS 

^^ea contains ttebtsfoftfTs^ 

?* ***• *» ^ ^PaTcc^ tole^FAC I^' e ^ * 

a> before, these knoTnaere^q^ ™ v ™ZT ^ Sa " UenCe6 ** mn ^ ieo " As 
menu the user is ^^^2^ Srf oX^T"^ **■ ^fr' 0 "* 11 * in Ws 

later time. The project file also eX^TtotaT of * a ^. TT" 8 ^ '" 8 ^ 080 be loaded in * 8 
purposes. Sequences to be ^SZ^2^Zl^Tlr^ m linked ** 

ir^ut device in the reference or WrZT^Tl * S6 ' 6t * n9 SeqUSnCa ,flenamB "« h a " 

* areas 1 J^«^^^^« * - and sanple sequence 

linked to the reference seque^^S e^n th.^, ^T** "* Sample 6e * ences <*" «» 

probes associated *Ta b3^ o^t n^St 1 9raph ShOWS lha in,6nsities each of the hybridized 

No2 JS S MrT^^ WM0WS ** Sel8c,ed tose * (SEQ ID NO^SEQ^ N033 SEQ ID 

ft » has associated probe irter^tydate ,as * T^TJ^T axpefime "' ""V «*> either a reference 

a rectangubr box 1M 8 shoS sTectSes^^Tia " Mquence / Tbe *• Signed «■ 

rectangular box aids the user in tfentSe^tXTes 6 S8qUenCeS (P0Sifon 162 R * ,8 >' 

andsa^AtTeS^ 
wiwWsequer^^ 

"quenc^refen^Lu^^ to any linked reference 

and sarrple sequenced 2S^22^ST£^ ** after »• u "' '**<* a reference 

areSmflZeftere s^ue^ Lnt o^:7^' **" "» W 

inadrfferent color). Inanotherex^eTeulSt^^.! ^ ^ tothauser <•*• base is shown 
After a sarrple ,s linked to ISer^Ts^u^rea^ b^^^^'"" 1 he * We ^ 
^"-nce**^^ 
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NO:17. and SEQ ID N0 18) A window 102^ 2 10 J** 14 - SEC > 10 NCM5, SEQ ID NO:16. SEQ ID 
muton, sequence dm^ll J^^w^^n * r*"' ^ U ^ e ,106 ' 

^^sequence^area^S 

sanplesotutionsmo. m 2> ?i^°v^ Tne * e T* 5equence ^ «*• *• 





Sample Solution 


Chip WiU-Type:Mutant 




1110 


100:0 




1112 


90:10 




1114 


7525 


25 


1116 


50:50 




1118 


25:75 




1120 


10:90 




1122 


0:100 



F< " SSll* ""contains 75% chip wild-type sequence and 25% mutant sequence 

1 1 1 0 is 1 00% chin wiiri as ,n me ^'P-w W type sequence. Thts is correct because sample solution 

75% to 50%chp-wild type sequenceand tarn 25%io.^™^ l ^= ause * e samD ' e solutions have from 

calls the base inWs ti*ta££ °" S8qU<mCe ^ intenrity rafe method ™<**i 

o^rr:^*^ 

in «?Sut ^st^e^XfSTr 06 *" "*° method ^ owever - the base 

ratio nSS l Sc^ "!k"T! * 8 A * «»« ^oa The intensity 

the other p^SeT ^ base as C because the probe .ntensrty associated with base C is much higher^ 

^^^^^ITZ^^t^: *e referee sequence is 

the position indicated by the recta^uterb^ 71,6 referenoe se * ,ence * called a base C at 
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Example 1 



chlo^n^^nMfl " *** ienCe SnatySiS ° f van0US P 0 '*™^ HIV-1 dones using a protease 

ch.p. Single branded DNA of a 382 nt region was used with 4 different clones (HXB2, SF2, NY5, pPol4mut18) Resute 
were compared to results from an ABI sequencer. The results are illustrated below- 





ABI 


Protease Chip 




Sense 


Antisense 


Sense 


Antisense 


No call 


0 


4 


9 


4 


Ambiguous 


6 


14 


17 


8 


Wrong call 


2 


3 


3 


1 


TOTAL 


8 


21 


29 


13 


SUMMARY 
ABI (sense) - 99.5% 
Chip (sense) - 98.1% 
ABI (antisense) - 98.6% 
Chip (antisense) - 99.1% 



Example 2 



HIV protease genotyping was performed using the described chips and CallSeq™ intensity ratio calculations. Sam- 
ples were evaluated from AIDS patients before and after ddl treatment. Results were confirmed with ABI se^ncC 
«i~ ccn .n^ltf *° ^ thG vlewSeq™ program with four pretreatment samples and four posttreatment sam- 
ples (SEQ ID N022. SEQ ID N023. SEQ ID N024, SEQ ID N055, SEQ ID NO:26, and SEQ IDN027) NoteTe 

^^^P^^M ^ 8 mUt3ti0n ariSen * ^ 8tf,aCent *° a * M0nal rn *** 0n!i m the " 3 " mulalion 
™* description is illustrative and not restrictive. Many variations of the invention will become apparent to 
mose of stalhn the art upon review of this disclosure. Merely by way of example, while the invention is illustrated with 
partner reference to me evaluation of DNA (natural or unnatural), the methods can be used in the analysis from chips 
^TJlSn^f T t^^ 6 ' 900 ' *** 88 RN * ™ esc <*»°^ invention should, therefore, be determined 
^ e a ^° Ve deSCnpt,on " instead should ^ determined with reference to the appended claims 
along with their full scope of equivalents. ucums 
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so 
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SEQUENCE LISTING 



(1) GENERAL INFORMATION: 



(i) APPLICANT: 

(A) NAME: Affymax Technologies N.V. 

(B) STREET: De Ruyderkade 62 

(C) CITY: Curacao 

(E) COUNTRY: Netherlands Antilles 

(F) POSTAL CODE (ZIP) : none 

Ui> TITLE OF INVENTION : Computer -Aided Visualization and 
Analysis System for Sequence Evaluation 

(iii) NUMBER OF SEQUENCES: 39 

(iv) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: Floppy disk 

(B) COMPUTER: IBM PC compatible 

(C) OPERATING SYSTEM: PC - DOS /MS - DOS 

(D) SOFTWARE: Patentln Release #1.0, Version #1.25 (EPO) 



(2) INFORMATION FOR SEQ ID NO:l: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRAND EDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO:l: 
ATGTGGACAG TTGTA 



50 
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(2) INFORMATION FOR SEQ ID NO: 2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 2: 
ATGTGGATAG TTGTA 



(2) INFORMATION FOR SEQ ID NO: 3: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 
ATGTGGAKAG TTGTA 
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(2) INFORMATION FOR SEQ ID NO: 4: 

(X) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRAND EDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 
AAAACTGAAA A 



(2) INFORMATION FOR SEQ ID NO: 5: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 
AAAACCGAAA A 
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(2) INFORMATION FOR SEQ ID NO: 6: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRAND EDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 
AAACCCAATC CACATCA 



(2) INFORMATION FOR SEQ ID NO: 7: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRAND EDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 7: 
AAACCCAGTC CACATCA 
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(2) INFORMATION FOR SEQ ID NO: 8: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 31 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 8: 
GGGGAAGCAG ATTTGGGTAC CACCCAAGTA T 



(2) INFORMATION FOR SEQ ID NO: 9: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 31 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 9: 
GGGGAAGCAG ATTTGAAMAC CACCCAAGTA T 
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(2) INFORMATION FOR SEQ ID NO: 10: 

(i) SEQUENCE CHARACTERISTICS • 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:10: 
GCATTAGTAG AGATATGTAC AGAAATGGAA AAGGAAGGGA AAATTTCAAA 



(2) INFORMATION FOR SEQ ID NO: 11: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 11: 

AGAGATGGAA AAGGAAGGGA AAATTTCAAA 



31 



EP0717113A2 



(2) INFORMATION FOR SEQ ID NO: 12: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 12: 
AATTGGGCC** AGATATGGAG AGRARDGGRA AXXXAAGGGA AAATTNNNAA 



(2) INFORMATION FOR SEQ ID NO: 13: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 13: 
GCATTAGTAG AGATATGKAS AGRARDGGRA AXXXAAGGGA AAAKTNNNAA 
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(2) INFORMATION FOR SEQ ID NO: 14: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 14: 
SJJ G AGATATGKAS AGRRRDGGRA AXXXAAGGGA AAADTYNNAA 



(2) INFORMATION FOR SEQ ID NO: 15: 

(i) SEQUENCE CHARACTERISTICS * 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 15: 
AATTGGGCC** AGATATGTAS AGRRADGGAA AXGGAAGGGA AAATTNNNNA 
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(2) INFORMATION FOR SEQ ID NO: 16: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 16: 
aattg£gcc G agatatgtac ag *gagggaa AXGGAAGGGA AAATTNNNNA 



(2) INFORMATION FOR SEQ ID NO: 17: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO:17: 
AATTGGGCC^ AGATATGTAS AGRGAGGGAA AXGGAAGGGA AAATTNNNNA 
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(2) INFORMATION FOR SEQ ID NO: 18: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 59 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 
GCATTAGTAG GAGGNNNGAC AGGGRKGGAA AXXMAAGGGA AAAKTNNNAA 



59 



(2) INFORMATION FOR SEQ ID NO: 19: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 19: 
ScCTO CTATGTCCTC GTCTACTATG TCATAATCTT CTTTACTTAA 
TTTTACTATC CCCCTTAACC TCCAAAATAG TTTCATTCTG 

CTATGGACAT CTTTAGACAC CTGTATTTCG ATATCCATGT 

160 



60 
120 



55 
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(2) INFORMATION FOR SEQ ID NO: 20: 

(i) SEQUENCE CHARACTERISTICS* 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(xi) SEQUENCE DESCRIPTION : SEQ ID NO:20: 

NTATGTCCTC GTCYACTATG TNANNNNNNN NNNNNNNNAA 

SZS ^^^^^^ CNNCNTAACC TCCAAAATAN NNNNNNTCTN 

CTANNNGNAG NNNNAGANAR NCCNNNNNNN NNATNCATGT 

160 



60 
120 



(2) INFORMATION FOR SEQ ID NO: 21: 

(i) SEQUENCE CHARACTERISTICS- 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY : linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO:21: 
SS CTATGTCCTC GTCTACTATG TCATAATNNN NNNNACTTAA 

60 

TXTT ^ aaC CCCCTTAACC TCCAAAATAG TTTCATTCTG 

CTATGNGNNG NNNTAGACAG KCCNNNNTCG ATATCCATGT "! 

160 
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(2) INFORMATION FOR SEQ ID NO: 22: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 22: 
ACGGTCCTTT CTATGTCCTC GTCT ACTATG TCATAATCTT CTTTACTTAA 

?S?ag£ TTTTACTATC cnncttaacc tccaaaatag TTTCATTCTG 

CTATGGGTAG CTTTAGACCN CCGTATTTCG ATATCCATGT 160 



60 
120 



(2) INFORMATION FOR SEQ ID NO: 23: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 23: 

ac^tcc^ raTCTCCTC gtctactatg tcataatctt ctttacttaa 

60 

tcSSS^ TTTTACTATC CCNCTTA ACC TCCAAAATAG TTTCATTCTG 

120 

CTATGGGTAG CTTTAGACCC CCGTATTTCG ATATCCATGT 160 



55 
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so 



(2) INFORMATION FOR SEQ ID NO: 24: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 24: 
ACGGTCC^C NTATGTCCTC GTCYACTATG TCANNNNNCN NNCNNNNCAA 

60 

NNCNNCYANG AANCYCAACC TCCAAAATAN NNNNNNTCTN 

120 

CTNNNNNNAG NGNNAGACAC CTGTATNNNN NTATNCAYGT 160 



30 (2) INFORMATION FOR SEQ ID NO: 25: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 160 base pairs 
<B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



35 



40 



(ii) MOLECULE TYPE: DNA (oligonucleotide) 



(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 25: 
45 acggtS ctamicctc gtctactatg TCATAATCCN NNCNNCTCAA 

60 

N^SsT TNYTACTATC CCCCTTAACC TCCAAAATAG TTTCATTCTG 

120 

so CTANNNNNAG NGTTAGACAC CTGTATTTCG ATATCCATGT 160 
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so 



25 



(2) INFORMATION FOR SEQ ID NO:26: 

(i) SEQUENCE CHARACTERISTICS • 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRAND EDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 
(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 26: 

IcggS? ctamtc <« gtctactatg tcataatccn NCCTACTCAA 

60 

^TTACTATC CMCCTTAACC TCCAAAATAG TTTCATTCTG 
CTATGAGTAG CTTTAGACAC CTGTATTTCG ATATCCATGT 

JLoO 



30 



(2) INFORMATION FOR SEQ ID NO: 27: 

(i) SEQUENCE CHARACTERISTICS * 

(A) LENGTH: 160 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION : SEQ ID NO: 27: 

CTATGTCC *C GTCTACTATG TCATAATCTT CTTTACYCAA 
TTTTACTATC CCMCTTAACC TCCAAAATAG TTTCATTCTG 
CTATGAGTAG CTTTAGACAC CTGTATTTCG ATATCCATGT X6Q 



60 



39 



EP0717113A2 



(2) INFORMATION FOR SEQ ID NO: 28: 

(X) SEQUENCE CHARACTERISTICS ■ 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 28: 
AAACCCAATC CACATCM 



(2) INFORMATION FOR SEQ ID NO: 29: 

(i) SEQUENCE CHARACTERISTICS- 

(A) LENGTH: 17 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 29: 
MMACNCANNC CACANNM 
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(2) INFORMATION FOR SEQ ID NO: 30; 

(i) SEQUENCE CHARACTERISTICS * 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 30: 
TTGGGTACCA C 



(2) INFORMATION FOR SEQ ID NO: 31: 

(i) SEQUENCE CHARACTERISTICS' 

(A) LENGTH : 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 31: 
TTGAAMACCA C 
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(2) INFORMATION FOR SEQ ID NO: 32: 

(i) SEQUENCE CHARACTERISTICS • 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRAND EDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 32: 
ACAGAAATGG A 



(2) INFORMATION FOR SEQ ID NO: 33: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 33: 
AGAGRATDGG R 
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(2) INFORMATION FOR SEQ ID NO: 34: 

(i) SEQUENCE CHARACTERISTICS * 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION : SEQ ID NO: 34: 
ASAGRRADGG A 



(2) INFORMATION FOR SEQ ID NO: 35: 

(i) SEQUENCE CHARACTERISTICS- 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 35: 
ACAGGGRRGG A 
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(2) INFORMATION FOR SEQ ID NO:36: 

(1> S ^w^ E CHARACTERISTICS: 
A LENGTH: 11 base pairs 
B TYPE: nucleic acid 
n ^RANDEDNESS: single 
(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE* nwa < 

lift,. DNA (oligonucleotide) 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:36: 
CTGGGGGGTA T 



11 



50 
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(2) INFORMATION FOR SEQ ID NO:37: 

(i) SEQUENCE CHARACTERISTICS * 
(A) LENGTH: u base pairs 
B TYPE: nucleic acid 
C STRANDEDNESS: single 
(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTJON: SEQ ID NO:3 7: 
CTGGCCSGTG T 



(2) INFORMATION FOR SEQ ID NO: 38: 

(i) SEQUENCE CHARACTERISTICS' 
(A) LENGTH: U base pairs 
B TYPE: nucleic acid 

(C) STRANDEDNESS: sinqle 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA (oligonucleotide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO:3 8 : 
CTGGGCGGTA T 



11 
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(2) INFORMATION FOR SEQ ID NO: 39: 
(i) SEQUENCE CHARACTERISTICS • 

IB) TYPE: nucleic acid 
(C STRANDEDNESS: single 
(D) TOPOLOGY : linear 

di) MOLECULE TYPE: DNA (Oligopeptide) 

(Xi) SEQUENCE DESCRIPTION: SEQ ID NO: 39: 

CTGGCACGTG T 

11 
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